Citation Segmentation from Sparse & Noisy Data: An Unsupervised Joint Inference Approach with Markov Logic Networks
نویسندگان
چکیده
Citation Segmentation in a Digital Humanities Context. Bibliographies are an important resource for scientific research. Their storage in (online) bibliographic databases offers efficient search functionalities for wide-spread and timely use in international research communities. For this purpose it is crucial to automatically detect the inherent structure of bibliographic references, by isolating and extracting citation subfields (e.g., author, title, venue). Previous approaches in citation segmentation strongly rely on language-specific lexical data and multiple occurrences of the same citation entry in online publication repositories. However, when dealing with multilingual data, the use of language-specific knowledge becomes difficult. Moreover, self-contained data sources like printed bibliographies are naturally short of recurring citation entries, and thus cannot rely on data redundancy. In this work, we present an approach to citation segmentation that operates on sparse and noisy OCR input originating from a single, multilingual bibliography, the Turkology Annual (Turkologischer Anzeiger).1 The Turkology Annual is a bibliography for Turkology and Ottoman studies, comprising 28 volumes which are only available in print. Citation entries containing multiple languages and scripts, the shortage of citation redundancy, frequent OCR errors and inconsistencies in citation structure impede the use of state-of-the-art statistical approaches for citation segmentation.
منابع مشابه
A Generalized Joint Inference Approach for Citation Matching
Citation matching is the problem of extracting bibliographic records from citation lists in technical papers, and merging records that represent the same publication. Generally, there are three types of datasets in citation matching, i.e., sparse, dense and hybrid types. Typical approaches for citation matching are Joint Segmentation (Jnt-Seg) and Joint Segmentation Entity Resolution (Jnt-Seg-E...
متن کاملJoint Unsupervised Coreference Resolution with Markov Logic
Machine learning approaches to coreference resolution are typically supervised, and require expensive labeled data. Some unsupervised approaches have been proposed (e.g., Haghighi and Klein (2007)), but they are less accurate. In this paper, we present the first unsupervised approach that is competitive with supervised ones. This is made possible by performing joint inference across mentions, i...
متن کاملJoint Inference in Information Extraction
The goal of information extraction is to extract database records from text or semi-structured sources. Traditionally, information extraction proceeds by first segmenting each candidate record separately, and then merging records that refer to the same entities. While computationally efficient, this approach is suboptimal, because it ignores the fact that segmenting one candidate record can hel...
متن کاملSemantic analysis of spoken input using Markov logic networks
We present a semantic analysis technique for spoken input using Markov Logic Networks (MLNs). MLNs combine graphical models with first-order logic. They are particularly suitable for providing inference in the presence of inconsistent and incomplete data, which are typical of an automatic speech recognizer’s (ASR) output in the presence of degraded speech. The target application is a speech int...
متن کاملKnowledge-leveraged Computational Thinking through Natural Language Processing and Statistical Logic (NII Shonan Meeting 2011-4)
This talk describes a recent effort on the development of a textual entailment data set. Rather than assuming a sub-component of applications like question answering and multi-document summarization, we focus on a realworld task to judge whether a natural language proposition is true or false according to a given text. I will describe the design of resource development and features of the obtai...
متن کامل